---
layout: ../layouts/Layout.astro
title: "A Status Check on Current Vision-Language Models in Text Recognition and Understanding"
description: Project page for TRUE
favicon: favicon.svg
thumbnail: screenshot-light.png
---

import Header from "../components/Header.astro";
import Video from "../components/Video.astro";
import HighlightedSection from "../components/HighlightedSection.astro";
import SmallCaps from "../components/SmallCaps.astro";
import Figure from "../components/Figure.astro";
import Image from "../components/Image.astro";
import TwoColumns from "../components/TwoColumns.astro";
import YouTubeVideo from "../components/YouTubeVideo.astro";
import LaTeX from "../components/LaTeX.astro";

import { ImageComparison } from "../components/ImageComparison.tsx";

import outside from "../assets/outside.mp4";
import transformer from "../assets/transformer.webp";
import Splat from "../components/Splat.tsx"
import dogsDiffc from "../assets/dogs-diffc.png"
import dogsTrue from "../assets/dogs-true.png"
import textheatmap from "../assets/ocr_heatmap.png"
import editedExamples from "../assets/edited_examples.png"

import diffDoc from "../assets/edited-acc-drop.png"
import diffScene from "../assets/accuracy_diff_Scene.png"
import book from "../assets/book.png"
import food from "../assets/food.png"
import doc from "../assets/docvqa.png"
import fb from "../assets/fb.png"
import gen from "../assets/gen.jpg"
import hw from "../assets/hw.jpg"
import receipt from "../assets/receipt.png"
import scene from "../assets/scene.jpg"
import diet from "../assets/diet.png"
import chart from "../assets/chart.png"


import CodeBlock from "../components/CodeBlock.astro";
import Table from "../components/Table.astro";
export const components = {pre: CodeBlock, table: Table}

<Header
  title={frontmatter.title}
  authors={[
    {
      name: "BAAI FlagEval Team",
      url: "",
      notes: [""],
    },
  ]}
  links={[
    {
      name: "Paper",
      url: "",
      icon: "ri:file-pdf-2-line",
    },
    {
      name: "Code",
      url: "https://github.com/flageval-baai/FlagEvalMM",
      icon: "ri:github-line",
    },
    {
      name: "arXiv",
      url: "",
      icon: "academicons:arxiv",
    },
    {
      name: "Dataset",
      url: "https://huggingface.co/datasets/BAAI/TRUE",
      icon: "simple-icons:huggingface",
      iconColor: "#FFD21E",
    }
  ]}
  />


<HighlightedSection>

## Abstract

Recent vision-language models (VLMs) have demonstrated impressive performance in text recognition and understanding, as shown by the metrics on a number of text-centric benchmarks. In this study, we take a closer look at the current success. Following popularly used relevant benchmarks, we conduct more analysis on re-collected and edited data. We find that while modern VLMs are indeed showing strong text recognition and understanding capabilities, the strength might have been slightly over-estimated for some models with the risk of benchmark saturation and overfitting. We discuss the implications and prepare for TRUE, our new benchmark for Text Recognition and Understanding Evaluation, with regular updates in mind. The first version of our benchmark confirmed the huge advantage of very recently released top-tier VLMs, while also showing a bit room for further improvement. We hope our analysis and benchmark updates could help contribute to the development and evaluation of relevant progress in the near future.
</HighlightedSection>

## A status check

We conduct our analysis by using popular benchmarks as reference.
For text recognition, we attempt at a replication of the recognition subsets in <a href="https://arxiv.org/abs/2305.07895">OCRBench</a>. In brief, we <strong>strictly follow the data collection process</strong> of the source dataset. The performance of the current top-tier VLMs in Figure 1 shows distributional overfitting on text recognition tasks.

<Figure>
  <Image slot="figure" source={textheatmap} altText="Heatmap of the text recognition between OCRBench and TRUE." />
  <span slot="caption">Figure1: Heatmap of the text recognition performance of different models on different subsets of OCRBench.</span>
</Figure>


For text understanding, we conduct minimal perturbations on DocVQA and TextVQA, which means that only the area of target answer is changed, and any other visual element that has been semantically related to the old answer has been simultaneously modified. Two edited examples are shown in Figure 2:

<Figure>
  <Image slot="figure" source={editedExamples} altText="Edited examples from TextVQA and DocVQA." />
  <span slot="caption">Figure 2: Two edited examples from TextVQA and DocVQA. </span>
</Figure>

Performance drops on edited images show that: 

(1) Accuracy descent exists on both DocVQA and TextVQA.

(2) Text recognition and understanding in richer textual context(DocVQA) might be slightly more robust than recognizing text from scenes with very little textual context(TextVQA).

<Figure>
  <Image slot="figure" source={diffDoc} altText="Diagram of the performance difference before and after editing." />
  <span slot="caption">Figure 3: Accuracy drop on minimally edited images from TextVQA and DocVQA.</span>
</Figure>


## New collected benchmark

Based on our preliminary findings, We assemble our re-collection of new data as the first batch of our new benchmark, named Text Recognition and Understanding Evaluation suite (TRUE).
The first version of our benchmark includes 1,146 image-text pairs (a subset of 573 forming a more challenging hard set), including:
<p><strong>SceneOCR:</strong> general scene-text recognition</p>
<p><strong>HW: </strong> Handwriting recognition on paper, boards, or digital pages</p>
<p><strong>SceneVQA: </strong> Scene-text understanding (VQA)</p>
<p><strong>DocumentVQA: </strong> Document understanding (VQA)</p>
<p><strong>ChartInfo: </strong> Chart and infographics understanding (VQA)</p>
<p><strong>Receipt VQA: </strong> Receipt understanding</p>
<p><strong>Food VQA: </strong> Food ingredients understanding on product packages</p>
<p><strong>FB: </strong> Recognizing fake brands (recognition vs. language bias)</p>
<p><strong>Book: </strong> Books on the Bookshelf (naturally rotated text)</p>
<p><strong>Diet: </strong> Dietary VQA (understanding beyond ingredient extraction)</p>

<TwoColumns>
  <Figure slot="left">
      <span slot="title">Receipt understanding</span>
      <Image slot="figure" source={receipt} altText="receipt" />
      <span slot="caption">Q: What is the total amount of this receipt? Answer this question using the text in the image directly. A: 31.91</span>
  </Figure>
  <Figure slot="right">
    <span slot="title">Food ingredients understanding on product packages</span>
    <Image slot="figure" source={food} altText="food" />
    <span slot="caption">Q: what is the value for Carbohydrates Per Serving? Answer this question using the text in the image directly. A: 3g</span>
  </Figure>
</TwoColumns>

<TwoColumns>
  <Figure slot="left">
      <span slot="title">Fake Brand</span>
      <Image slot="figure" source={fb} altText="fb" />
      <span slot="caption">Q: Identify the brand shown in the image and provide your answer exactly as it appears. A: PolyStation</span>
  </Figure>
    <Figure slot="right">
    <span slot="title">Books on the Bookshelf</span>
    <Image slot="figure" source={book} altText="book" />
    <span slot="caption">Q:What is the title of the book written by BELTING in the image? Answer this question using the text in the image directly. A: FACE AND MASK</span>
  </Figure>
</TwoColumns>

<TwoColumns>
  <Figure slot="left">
      <span slot="title">SceneOCR</span>
      <Image slot="figure" source={gen} altText="gen" />
      <span slot="caption">Q: Extract all text you can find in the image. A: ["GALLOP", "AIR"]</span>
  </Figure>
  <Figure slot="right">
    <span slot="title">Handwriting recognition</span>
    <Image slot="figure" source={hw} altText="hw" />
    <span slot="caption">Q: Extract all text content from the image. A: ["wednesday, nine of june 2010.", "Natalia"]</span>
  </Figure>
</TwoColumns>

<TwoColumns>
  <Figure slot="left">
      <span slot="title">Scene-text understanding</span>
      <Image slot="figure" source={scene} altText="scene" />
      <span slot="caption">Q: On what date is the festival scheduled featuring Gwada Mike? A: ["July", "22", "2023"]</span>
  </Figure>
    <Figure slot="right">
      <span slot="title">Dietary VQA</span>
      <Image slot="figure" source={diet} altText="diet" />
      <span slot="caption">Is the product in the image nut-free? If the answer is No, please specify the questionable ingredients. Otherwise please answer 'Yes'. A: Yes</span>
  </Figure>
</TwoColumns>

<TwoColumns>
  <Figure slot="left">
    <span slot="title">Document understanding</span>
    <Image slot="figure" source={doc} altText="doc" />
    <span slot="caption">Q: What is the purpose of the attached handouts mentioned in the email? A: self study module</span>
  </Figure>
  <Figure slot="right">
    <span slot="title">Chart and infographics understanding</span>
    <Image slot="figure" source={chart} altText="chart" />
    <span slot="caption">Q: What is the combined retail sales share of IKEA worldwide in fiscal year 2019 for the categories of Living room, Children's IKEA, and IKEA food? A: 29%</span>
  </Figure>
</TwoColumns>


## Model performance on hard subset of TRUE
|            Model         | All | HW | SceneOCR | FakeBrands | BookVQA | DietVQA | Chart&Info | SceneVQA |
| :----------------------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| gemini-2.5-pro-preview-03-25 | **81.6** | **88.2** | 68.4 | 67.9 | **95.4** | 68.8 | **83.6** | **80.3** |
| gemini-2.0-pro-exp | 75.0 | 73.7 | **73.7** | **75.0** | 82.9 | **70.5** | 72.6 | 69.0 |
| gemini-2.0-flash-exp | 75.0 | **89.5** | 54.4 | 67.9 | 91.1 | 65.2 | 72.6 | 69.0 |
| gpt-4o-2024-11-20 | 66.9 | 60.5 | 70.2 | 57.1 | 78.7 | 60.7 | 54.8 | 64.8 |
| claude-3-7-sonnet-20250219 | 51.7 | 60.5 | 35.1 | 28.6 | 38.5 | 61.6 | 75.3 | 62.0 |
| Qwen2.5-VL-72B-Instruct | **64.9** | **76.3** | **70.2** | 57.1 | 55.8 | 68.8 | 72.6 | 66.2 |
| InternVL2\_5-78B | 60.9 | 50.0 | 45.6 | 50.0 | **72.4** | 63.4 | 57.5 | 54.9 |
| Qwen2.5-VL-32B-Instruct | 59.9 | 57.9 | 49.1 | 35.7 | 50.0 | 65.2 | **80.8** | **73.2** |
| Pixtral-Large-Instruct-2411 | 57.1 | 50.0 | 45.6 | 42.9 | 53.5 | **73.2** | 72.6 | 43.7 |
| Mistral-Small-3.1-24B | 51.2 | 73.7 | 31.6 | 17.9 | 47.7 | 54.5 | 72.6 | 49.3 |
| Qwen2.5-VL-7B-Instruct | 49.4 | 55.3 | 56.1 | **60.7** | 50.0 | 30.4 | 56.2 | 57.8 |
| Molmo-72B-0924 | 47.9 | 36.8 | 38.6 | 35.7 | 51.9 | 52.7 | 46.6 | 50.7 |
| Meta-Llama-3.2-90B-Vision | 41.5 | 21.1 | 54.4 | 21.4 | 55.7 | 28.6 | 49.3 | 31.0 |
| Pixtral-12B-2409 | 41.0 | 39.5 | 24.6 | 28.6 | 33.3 | 56.2 | 67.1 | 28.2 |
| MiniCPM-o-2\_6 | 40.7 | 55.3 | 42.1 | 17.9 | 35.1 | 36.6 | 58.9 | 42.2 |
| InternVL2\_5-8B | 38.0 | 26.3 | 38.6 | 25.0 | 39.7 | 36.6 | 46.6 | 38.0 |
| Molmo-7B-D-0924 | 32.8 | 18.4 | 40.4 | 35.7 | 34.2 | 23.2 | 39.7 | 38.0 |
| Qwen2.5-VL-3B-Instruct | 32.7 | 57.9 | 28.1 | 25.0 | 29.3 | 28.6 | 37.0 | 36.6 |
| llava-onevision-qwen2-72b | 31.8 | 10.5 | 42.1 | 21.4 | 0.6 | 62.5 | 50.7 | 47.9 |
| llava-onevision-qwen2-7b | 20.4 | 13.2 | 26.3 | 17.9 | 4.6 | 36.6 | 26.0 | 28.2 |
| Idefics3-8B-Llama3 | 16.6 | 5.3 | 3.5 | 7.1 | 17.2 | 22.3 | 21.9 | 21.1 |
| Meta-Llama-3.2-11B-Vision | 15.9 | 0.0 | 3.5 | 7.1 | 39.1 | 2.7 | 15.1 | 2.8 |
| Phi-4-multimodal-instruct | 15.4 | 2.6 | 10.5 | 14.3 | 3.5 | 20.5 | 35.6 | 26.8 |


## BibTeX citation

```bibtex
@misc{baaiflageval2025true,
  author = "{BAAI FlagEval Team}",
  title = "A Status Check on Current Vision-Language Models in Text Recognition and Understanding",
  year = "2025",
  howpublished = "https://flageval-baai.github.io/TRUE/",
}
```